Supplementary Material for Learning Robust Visual-Semantic Embeddings

نویسندگان

  • Yao-Hung Hubert Tsai
  • Liang-Kang Huang
  • Ruslan Salakhutdinov
چکیده

Fig. 1 provides an easy-to-understand design of ReViSE. In all of our experiments, GoogLeNet is pre-trained on ImageNet [2] images. Without fine-tuning, we directly extract the top layer activations (1024-dim) as our input image features followed by a common log(1+v) pre-processing step. For the textual attributes, we pre-process them through a standard l2 normalization. In ReViSE, we set α = 1.0 in eq. (11), so that we place equal importance on supervised and unsupervised objectives. For the visual auto-encoder, we fix the parameter of the contraction strength γ = 0.1 in eq. (2). In the following, we omit the bias term in each layer for simplicity. The encoding of visual features is parameterized by a twohidden layer fully-connected neural network with architecture dv1−dv2−dc, where dv1 = 1024 is the input dimension of the visual features, dv2 = 500 is the intermediate layer, and dc denotes the dimension of the visual codes ṽh. To encode textual attributes, we consider a single-hidden layer neural network dt1−dc, where dt1 is the input dimension of the textual attributes. We choose dc = 100 when dt1 > 100 and dc = 75 when dt1 < 100. Furthermore, we do not tie the weights to be learned between the decoding and encoding parts. Parameters for associating distributions of visual and textual codes (MMD Loss) in eqs. (5) (12), and (6) are set as β = {0.1, 1.0} (chosen by cross-validation) and κ = 32.0. For the remaining part of our model, we set the architecture of visual and textual code mapping as a single-hidden layer fully-connected neural network with dimension dc − 50. We also adopt a dropout of 0.7. During the first 100 iterations of training, we set λ = 0 so that no unsupervised-data adaptation is used while still updating Î i,c . Note that Î (ut) i,c are the inferred labels for unsupervised data, and not random at each iteration. Beginning with the 101th iteration, we set λ = {0.1, 1.0} (chosen by cross-validation), and the model typically converges within 2000 to 5000 iterations. We implement ReViSE in TensorFlow [1]. We use Adam [3] for optimization with minibatches of size 1024. We Class 1

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neuro-symbolic representation learning on biological knowledge graphs

Motivation Biological data and knowledge bases increasingly rely on Semantic Web technologies and the use of knowledge graphs for data integration, retrieval and federated queries. In the past years, feature learning methods that are applicable to graph-structured data are becoming available, but have not yet widely been applied and evaluated on structured biological knowledge. Results: We deve...

متن کامل

The Effect of Using Visual Aids, Semantic Elaboration, and Visual Aids plus Semantic Elaboration on Iranian Learners' Vocabulary Learning

This study investigated the effect of using visual aids, semantic elaboration, and visual aids plus semantic elaboration on the Iranian EFL learners' vocabulary learning. To conduct the study, the researchers assigned 49 elementary learners to three homogeneous groups according to their proficiency level. Then, a pre-test of Paribakht and Wesche's Vocabulary Knowledge Scale was given to each gr...

متن کامل

Zero-shot Recognition via Semantic Embeddings and Knowledge Graphs

We consider the problem of zero-shot recognition: learning a visual classifier for a category with zero training examples, just using the word embedding of the category and its relationship to other categories, which visual data are provided. The key to dealing with the unfamiliar or novel category is to transfer knowledge obtained from familiar classes to describe the unfamiliar class. In this...

متن کامل

Visually Aligned Word Embeddings for Improving Zero-shot Learning

Zero-shot learning (ZSL) highly depends on a good semantic embedding to connect the seen and unseen classes. Recently, distributed word embeddings (DWE) pre-trained from large text corpus have become a popular choice to draw such a connection. Compared with human defined attributes, DWEs are more scalable and easier to obtain. However, they are designed to reflect semantic similarity rather tha...

متن کامل

Supplementary Material: Unsupervised learning of object landmarks by factorized spatial embeddings

In this supplementary material we elaborate on several details regarding the experimental setup, provide an additional comparison with training a supervised network on small numbers of images and present numerous images giving a qualitative look at the performance of our method. It is organized as follows: Sec. 2 gives the additional details and hyperparameters, Sec. 3 compares quantitatively w...

متن کامل

Supplementary material for Global graph kernels using geometric embeddings

This document is supplementary material for Global graph kernels using geometric embeddings. It is organized as follows. In Section B, we prove Lemmas 1, 2 and 4. In Section C, we prove the Theorems 1 and 2 about sample complexity of the Lovász θ kernel and the svm-θ kernel. In Section D, we prove Lemma 3 about the margin of the Lovász θ kernel. In Section E, we give and prove a result about th...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017